			
			
			
		*GHS 2018 webinar work, public access do file
		* A Kerr, DataFirst 1 April 2020
		clear all
		set more off
		cap log close
		log using "C:\Users\Andrew Kerr\Google Drive\UCT work\DataFirst\Training\Webinars\GHS 2020\ghs2018_webinar.log", replace
		*log using "YOUR FILE PATH HERE\ghs2018_webinar.log", replace
		
		use "C:\Users\Andrew Kerr\Desktop\Andy\DataFirst\GHS\GHS 2018\Data\ghs-2018-house-1.0-stata11.dta"
		*use "YOUR FILE PATH HERE\ghs-2018-house-1.0-stata11.dta"
		rename uqnr UqNr
		drop Personnr
		merge 1:m UqNr using "C:\Users\Andrew Kerr\Desktop\Andy\DataFirst\GHS\GHS 2018\Data\ghs-2018-person-1.0-stata11.dta"
		*merge 1:m UqNr using "YOUR FILE PATH HERE\ghs-2018-person-1.0-stata11.dta"
		
		*helpful to see num labels when you browse the data
		numlabel, add
		
		count
		*71000 people in the sample.
		*how many households? Use egen function. More complicated
		*hh id variable is UqNr
		sort UqNr
		by UqNr: gen hh_one=1 if _n==1
		count if hh_one==1
		*20988 hh
		
		*****************MAIN TASK- create a household monthly income variable***************************
		*need to include social grants, earnings from work, remittances and private pensions
		
		
		*********************start with individual level data: grants and earnings********************
	*1. Social grants
		gen oap_g=1 if Q31bOAG==1 
		gen dis_g=1 if Q31bDIS  ==1
		gen cs_g=1 if Q31bCSG ==1
		gen caredep_g=1 if Q31bCAR ==1
		gen foster_g=1 if Q31bFOS ==1
		gen warvet_g=1 if Q31bWVT ==1
		gen gia=1 if Q31bGRN ==1
		gen soc_rel=1 if Q31bSOC ==1
		
		*now income from these grants
		gen oap_g_inc=1600 if oap==1
		gen dis_g_inc=1600 if dis_g==1
		gen cs_g_inc=380 if cs_g==1
		gen caredep_g_inc=1600 if caredep_g==1
		gen foster_g_inc=920 if foster_g==1
		gen warvet_g_inc=1620 if warvet_g==1
		gen gia_inc=380 if gia==1
		
		*now we sum up grant income for each person, to get their total grant income (people can get more than 1 grant)
		egen grant_income= rowtotal(oap_g_inc dis_g_inc cs_g_inc caredep_g_inc foster_g_inc warvet_g_inc gia_inc)
		
		
	*2. employment and income from work
		*use empstatus to identify the employed
		tab employ_Status1
		gen employed=1 if employ_Status1==1
		
		*differentiate formal and informal sector workers (just 1 Q to a worker asking if they work in formal or informal sector)
		tab Q45SEC
		gen employed_formal=1 if employ_Status1==1 & (Q45SEC==1 | Q45SEC==9)
		gen employed_informal=1 if employ_Status1==1 & (Q45SEC==2 | Q45SEC==3 )
		*assuming dont knows  are informal, and uncpecified =formal
		
		*now income:
		*3 relevant Qs: 
		*total salary or pay at main job: Q42aSTO 
		*what period is this amount above for: Q42bSP  (week, month or year)
		*if you refuse to answer: tell us what bracket you fall into Q43SALC
		*some people refuse to even give a bracket...
		
		*to make this easier I am just going to assume Stats SA's monthly earnings variable they created is correct.
		*that is an important assumption. They have "imputed" earnings for people who would not say how much they earned.
		*if we excluded the income of these people it would be like assuming their eanrings was zero, which is wrong.
		*so for now we'll use the Stats SA monthly earnings number Q42msal
		*stats SA gives the not employed weird large numbers for their earnings like 88888888
		*we must exclude these otherwise the unemployed would have very large incomes...
		gen monthlyearnings=Q42msal if Q42msal<888888
		
		*So now we have individual level incomes from grants and work
		*the other income questions, about remittances and non-grant pensions were asked at hh level.
		*we want a household-level measure of income, so we want to sum all the individual level incomes for each hh, and then add in remittances and pensions
		*remember that our dataset has hh level variables, but each row is an individual
		*so we shift gear a bit to doing things at a hh level. 
		*first we sum up the individual level incomes
		*to do that we use the egen command with by
		sort UqNr
		by UqNr: egen hh_earnings=total(monthlyearnings)
		by UqNr: egen hh_grants_inc=total(grant_income)
		
		;
		egen indiv_grants=rowtotal(oap_g dis_g cs_g caredep_g foster_g warvet_g gia) 
		by UqNr: egen hh_grants=total(indiv_grants)
		
		
		************now we need to add in income asked about only at the hh level ********************
	*3. Remittances from migrants	
		tab Q89bMain
		*this question shows what the hh reports as its main income. 
		*very few hh report other, like rental or interest (probably rich hh). But these incomes are not asked about. 
		sum Q810Rem if Q810Rem<8888888, d
		gen remittances=1 if Q810Rem<8888888
		gen hh_rem_inc=Q810Rem if Q810Rem<8888888
		
	*4. private pension income (ie not the government old age pension
		sum Q811Pen if Q811Pen< 8888888, d
		gen pension=1 if Q811Pen< 8888888
		gen hh_pension_inc=Q811Pen if Q811Pen< 8888888
		
	*now need to add the individual and hh sources 	of income together to create a hh income variable


		egen hh_income=rowtotal(hh_earnings hh_grants_inc hh_rem_inc hh_pension_inc)
		*this adds up all the hh sources of income
		
		sum hh_income [aw=house_wgt] if hh_one==1, detail
		count if hh_income==0
		*There are some hh reporting no income
		*this is worrying. Why might it happen? People dont report all income, they are scared to tell survey enumerators their income etc.
		*Since I want to get a quick and reasonable sense of how well off households are, rather than a perfect answer...
		*I use the question on household expenditure as an alternative to income for households reporting low income.
		*expenditure Q has 10 options.
		*I assume each hh expenditure is the midpoint of the bracket. 
		*So if they report expenditure of 1- R199 rand I assume they had R100 expenditure in the last month 
		*then if this expenditure number is larger than the income number I assume the true hh income is the larger expenditure number.
		*not ideal, and some of my teachers would not like this, but given the circumnstances it is not a terrible way to do things.
		
		
		tab Q814Exp
		gen hh_expenditure=0 if Q814Exp==1
		replace hh_expenditure=100 if Q814Exp==2
		replace hh_expenditure=300 if Q814Exp==3
		replace hh_expenditure=600 if Q814Exp==4
		replace hh_expenditure=1000 if Q814Exp==5
		replace hh_expenditure=1500 if Q814Exp==6
		replace hh_expenditure=2150 if Q814Exp==7
		replace hh_expenditure=3750 if Q814Exp==8
		replace hh_expenditure=7500 if Q814Exp==9
		replace hh_expenditure=15000 if Q814Exp==10
		

		gen hh_inc_exp=hh_income
		replace hh_inc_exp=hh_expenditure if hh_expenditure> hh_income & hh_expenditure<.
		*so hh_exp>hh_inc for about 20% of the sample... 
		count if hh_inc_exp==0
		count if hh_inc_exp==0 & hh_one==1
		*a few hh and people reporting no income or expenditure in the last month.
		
		*now create hh size and per capita income
		by UqNr: egen hh_size=count(PersonNR) 
		gen  hh_inc_exp_pc=hh_inc_exp/hh_size
		
		*then use xtile command to put people into 1 of 10 categories or deciles (1 means poorest 10%, 2= next poorest 10% and 10- = richest 10% 
		xtile hh_decile =hh_inc_exp_pc [pw=person_wgt], nquantiles(10) 
		
		*cont with hh level variables tomorrow!
		*try and figure out how to cut down that code- there is a lot...
		
		by UqNr: egen hh_employed=total(employed) 
		tab  hh_employed if hh_one==1
		
		by UqNr: egen hh_employed_formal=total(employed_formal) 
		tab  hh_employed_formal if hh_one==1
		
		gen hh_empl_form_any=1 if hh_employed_formal>0 & hh_employed_formal<.
		by UqNr: egen hh_employed_informal=total(employed_informal) 
				
		gen inc_cat=1 if hh_grants>0 & hh_grants<. & hh_employed==0
		replace inc_cat=2 if hh_employed_formal>0 & hh_employed_formal<.
		replace inc_cat=3 if hh_employed_formal==0 & hh_grants==0 & hh_employed_informal>0 &hh_employed_informal<.
		replace inc_cat=4 if  hh_grants>0 & hh_employed_informal>0 & hh_employed_formal==0
		replace inc_cat=5 if hh_employed==0 & hh_grants==0 & remittances==1 & pension!=1 
		replace inc_cat=6 if hh_employed==0 & hh_grants==0 &  pension==1 
		replace inc_cat=7 if hh_income==0 
		
		*hmm, need to fix the missings
		tab inc_cat, m
		tab inc_cat if hh_one==1, m
		 tab hh_employed if inc_cat==.
		 tab hh_employed_informal if inc_cat==.
		 tab employed_formal if inc_cat==., m
		 tab employed_informal if inc_cat==., m
		 *so there are employed but neither formal nor informal???
		 *concentrate on the hh_employed=zero  first...
		sum hh_income if inc_cat==., d
		
		label define inc_cat 1"Grants only" 2"At least 1 formal sect employee" 3"Informal sector workers only" 4"Informal+grant" 5"Remittances only" 6"Private pension only" 7"No income reported"
		label values inc_cat inc_cat
		graph bar  [pw=person_wgt], over(inc_cat) over(hh_decile) stack asyvars percentage title(HH status by income decile) note(Source: Stats SA GHS 2018) ytitle(% of individuals)
		graph export "C:\Users\Andrew Kerr\Google Drive\UCT work\DataFirst\Training\Webinars\GHS 2020\ak_2.png", replace as(png)
		
		*so far we only set weights. 
		*To get standard errors of means, totals etc that are unbiased, we must further tell stata what the complex sample design was,
		*specifically the PSUs and strata. 
		*It is not THAT important that you understand WHY this is required, 
		*but remembering that it is required is VERY important
		*We do that here. The result is to increase standard errors and confidence intervals, but not by much.
		egen strat=group(stratum )
		svyset PSU [pw=house_wgt], strata(strat)
		svy: total hh_grants if hh_one==1  
		mat list r(table) , format (%16.0g)
		*this is an estimate of the total number of social grants dispersed. 
		*Given the small sample its remarkably close to the 17million from this article: 
		*https://www.fin24.com/Opinion/analysis-heres-how-some-south-africans-are-using-their-social-grants-to-become-entrepreneurs-20190913-2
		*this is the magic of statistics... It doesnt always work, but it is remarkable. 
		
		cap log close
